library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.2.5
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.2.5
## Warning: package 'tibble' was built under R version 3.2.5
## Warning: package 'tidyr' was built under R version 3.2.5
## Warning: package 'purrr' was built under R version 3.2.5
## Warning: package 'dplyr' was built under R version 3.2.5
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
ggplot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy))
ggplot(data = mpg)
big grey box with nothing else 2.
dim(mtcars)
## [1] 32 11
32 and 11 respectively 3.
?mpg
does a car have front, read or 4 wheel drive. 4.
ggplot(data = mpg) + geom_point(mapping = aes(x=cyl, y=hwy))
5.
ggplot(data = mpg) + geom_point(mapping = aes(x=class, y=drv))
Neither descriptor included in the generation of this plot gives us details on mileage. While class has to do with size, we don’t know if these classes effect mileage. That being said these data do tell us which classes have which kind of drive which is useful for other questions.
ggplot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, color=class))
ggplot(data = mpg) + geom_point(mapping = aes(x=displ, y=class, color=cty))
ggplot(data = mpg) + geom_point(mapping = aes(x=displ, y=class, size=cty))
Color behaves continuously as well - i.e. different shades of a color indicate the different value associated with that descriptor. This is true of shape as well. Shape won’t even let you map a continuous variable to it, which honestly makes a lot of sense. 4.
ggplot(data = mpg) + geom_point(mapping = aes(x=class, y=cty, color=cty))
Both aesthetics are used. In this case cty scales with the y axis and the shade of a color. 5.
?geom_point
ggplot(data = mpg) + geom_point(mapping = aes(x=class, y=cty, stroke=(displ/2)))
Stroke alters the border width and therefore requires a numeric value. 6.
ggplot(data = mpg) + geom_point(mapping = aes(x=class, y=cty, color=(displ<5)))
GGplot will use the boolean operator you set up to determine how to display that aesthetic. In the code above two colors are used - one for false and one for true.
1
plot <- ggplot(data = mpg)
plot <- plot + geom_point(mapping = aes(x=displ, y=class))
plot <- plot + facet_wrap(~cty)
plot
you get a lot of facets from truncating or rounding the continuous variables to make them discrete.
plot1 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
plot2 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
plot1
plot2
The empty grids are plots where none of the data intersect. For example there are no 4 wheel drive cars with 5 cylinder engines. these correspond to the vertices of the second plot where no points lie.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
The top graph divides the plots on the Y axis whereas the bottom plot divides the plots by the X axis. The ‘.’ sort of acts as a null value for breaking up facet wrap plots. In this way the first ggplot could be read as drv by NULL while the second could be read as NULL by cyl.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Faceting the plots allows for better visualization of each class alone - you get a better sense of the distribution of the fuel economies for each class. Using the color aes gives a better view of the dataset as a whole - how the classes compare and how the distribution of fuel economy varies across the entire spectrum of classes. More data means a messier plot if you don’t break up the classes. If you wanted to visualze large data without faceting it you may want to change the alpha value to better visualize density of class values or facet the data.
?facet_wrap
These allow you to define the number of rows and columns used to facet the data. Scales allows you to set whether the scale should be fixed or set This can also be set to change only x or only y axes. You can also add labels, or scale the plot to statistical data (as opposed to raw). as.table changes the ordering of plots(highest value either bottom-right or top-right(FALSE)). switch can move labels from the top and right to bottom and left respectvely or both. You can also change the direction from horizontal(h) or vertical(v) with dir and strip.position lets you set the positioning of labels to top, bottom, left or right
?facet_grid
This requires row and column facetting variables and gets its nrow and ncol values from the number of discrete values associated with these variables.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ooooh since, color=drv is a global option there are three lines. 3.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point(show.legend=FALSE)
This removes the legend providing a key for the diffent kind of drives and the colors associated. If you remove it the legend is shown. I can’t find the exact example using show.legend=FALSE but I’d guess it was used so that we had to predict how the data were split by color. 4.
?geom_smooth
this can be set to either display, or not display the confidence interval surrounding the mean. 5. No, they shouldn’t. The first sets x and y as global variables to be inherited by the geoms added later, whereas the second explicitly binds each x and y to each geom. Hoever, they use the same x and y so the two graphs are essentially synonomous. 6.
#1
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
#2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE, aes(group=drv))
## `geom_smooth()` using method = 'loess'
#3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
#4
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color=drv)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
#5
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color=drv)) +
geom_smooth(se = FALSE, aes(linetype=drv))
## `geom_smooth()` using method = 'loess'
#6
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(fill=drv),shape = 21, color = "white", size=2.5, stroke=2.5)
?stat_summary
The default geom is geom_pointrange.
?geom_pointrange
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
?geom_col
?geom_bar
geom_col, unlike geom_bar, does not make the height of a bar proportional to the number of cases in each group. Instead, the height of bars represent values in the data, i.e. it requires a y value whereas geom_bar does not. 3. |Geom|Stat|Stat’s Geom| |———|———|———| |point |identity |point | |line |identity |point | |bar |bin |bar | |linerange|identity |point | |count |sum |point | |densty |density |area | |violin |ydensity |violin | |dotplot |density |area | |hex |binhex |hex | 4.
?stat_smooth
computes y(predicted value), ymin (lower confidence interval), ymax (upper confidence interval). It also computes the SE by default. method, se, n, span, method.args and formula all alter the output of the stat_smooth. 5.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 20))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
?geom_bar
without setting group = 1, group by default includes all x values. therefore all of the groups contain all of the values (it’s not grouped by the x values)
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
There’s a lot of overplotting in the first plot. adding some jitter gives a better idea of where the density lies.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_jitter()
2. /Library/Frameworks/R.framework/Versions/3.2/Resources/library/ggplot2/help/geom_jitter height and width affect the amount of vertical and horizontal jitter allowed and defaults to 40% of the resolution.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position = position_jitter(width=.4, height =0))
3. jitter: stat=“identity” and adds a small amount of variation. this is controllable. This seems to be more useful in judging the density of a distribution rather than comparisos(ie, scatter plots with overplotting or discrete intercepts) count: stat=“sum” and counts number of observations at each location. This seems to be more useful for comparative data rather than point data (ie, bar graphs where visual proportions can be seen) 4.
?geom_boxplot
ggplot(data=mpg, mapping=aes(x=class, y=hwy, fill=drv)) +
geom_boxplot()
ggplot(data=diamonds, aes(x=factor(1), fill=factor(cut))) +
geom_bar(width=1) +
coord_polar(theta="y")
2.
?labs
labs() allows you to change labels such as the x, y and title labels. 3.
?coord_quickmap
?coord_map
Map projections do not, in general, preserve straight lines, so this requires considerable computation. coord_quickmap is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator. 4.
?coord_fixed
this fixes the ratio of x units to y units at 1:1 the ratio can be changed but is not in this example.
?geom_abline
this adds a diagonal reference line, in this case starting at x=y=10 and contnuing to x=y=30 and onwards. These two features help define the visual relationship of cty and hwy on the chart and make it easier to visually analyze. It also makes it clear that hwy is consistently > mpg.